Frontiers in Artificial Intelligence
○ Frontiers Media SA
Preprints posted in the last 90 days, ranked by how well they match Frontiers in Artificial Intelligence's content profile, based on 18 papers previously published here. The average preprint has a 0.03% match score for this journal, so anything above that is already an above-average fit.
Kizilaslan, B.; Mehlum, L.
Show abstract
Purpose: Suicide and self-harm are major public health concerns characterized by substantial clinical and psychosocial heterogeneity. While latent class analysis has been used to identify subgroups of people with suicidal behavior, the extent to which such population-level phenotyping complements explainable artificial intelligence-based classification models remain unclear. Methods: We applied latent class analysis to a cross-sectional, publicly available dataset of 1000 individuals presenting with self-harm and suicide-related behaviors at Colombo South Teaching Hospital, Kalubowila, Sri Lanka. Sociodemographic, psychosocial, and clinical variables were used to identify latent subgroups. Class characteristics and suicide prevalence were examined and compared with variable importance patterns reported in a previously published explainable artificial intelligence (XAI)-based suicide classification study using the same dataset. Results: Four latent classes were identified. Two classes exhibited very high suicide prevalence (91.2% [95% CI: 87.7-93.8] and 99.0% [95% CI: 96.4-99.7]), whereas two classes showed low prevalence (<1%). The two high-prevalence classes differed markedly in lifetime psychiatric hospitalization history, with one class showing a 100% prevalence of prior hospitalization and the other substantially lower hospitalization rates. These patterns partially aligned with, and extended beyond, variable importance findings from the XAI-based model. Conclusion: Latent class analysis identified distinct subgroups with substantially different suicide prevalence and clinical profiles, underscoring the heterogeneity of individuals presenting with self-harm. Comparison with XAI-based suicide classification model findings suggest that unsupervised phenotyping and supervised classification provide complementary perspectives, offering population-level context that may enhance the interpretability of suicide assessment frameworks. Keywords: suicide; self-harm; latent class analysis; explainable artificial intelligence; machine learning
Piorkowska, N. J.; Olejnik, A.; Ostromecki, A.; Kuliczkowski, W.; Mysiak, A.; Bil-Lula, I.
Show abstract
Interpreting machine learning models typically relies on feature attribution methods that quantify the contribution of individual variables to model predictions. However, it remains unclear whether attribution magnitude reflects the true functional importance of features for model performance. Here, we present a unified interpretability framework integrating permutation-based attribution, feature ablation, and stability under perturbation across multiple feature spaces. Using nested cross-validation and permutation-based null diagnostics, we systematically evaluate the relationship between attribution magnitude and functional dependence in clinical and biomarker-based prediction models. Attribution magnitude is frequently misaligned with functional importance, with weak to strong negative correlations observed across feature spaces (Spearman {rho} ranging from -0.374 to -0.917). Features with high attribution often have limited impact on model performance when removed, whereas features with low attribution can be essential for maintaining predictive accuracy. These discrepancies define distinct classes of interpretability failure, including attribution excess and latent dependence. Interpretability further depends on feature space composition, and stable, functionally relevant features are not necessarily those with the highest attribution scores. By integrating attribution, functional impact, and stability into a composite Feature Reliability Score, we identify features that remain informative across perturbations and analytical contexts. These findings indicate that interpretability does not arise from attribution magnitude alone but is better characterized from stability under perturbation. This framework provides a basis for more robust model interpretation and highlights limitations of attribution-centric approaches in high-dimensional and correlated data settings.
Jayme, A.; Heuveline, V.
Show abstract
Background and ObjectiveGlioblastoma outcome prediction remains difficult because clinically relevant signals are distributed across heterogeneous imaging and genomic modalities, cohorts are small, and conventional neural predictors do not quantify their own uncertainty. This study evaluates a hybrid neural-Bayesian belief network framework for uncertainty-aware multimodal glioblastoma prediction and examines how modality selection, model family, and structure-aware regularization affect predictive performance and confidence quality. MethodsThe framework was evaluated on the TCGA-GBM radiogenomic cohort using four input modalities (T1Gd, FLAIR, mRNA, and CNA), five model families, five structural-weight settings, and 15 view subsets. A secondary benchmark on the UCI Human Activity Recognition dataset was included to assess whether observed limitations were specific to the glioblastoma setting. ResultsCNA features consistently reduced performance in most multimodal settings, and selective fusion excluding CNA outperformed both the full four-view baseline and imaging-only alternatives. Model families showed clear differences in uncertainty behaviour: non-Bayesian families achieved the strongest predictive accuracy, whereas the Bayesian family achieved the lowest calibration error over a narrower confidence range. Bayesian belief network regularization produced consistent directional improvements without supporting reliable structure-discovery claims, as learned graph structures were not reproducible across folds. On the secondary bench-mark, the same framework achieved much higher predictive performance, indicating that the glioblastoma performance ceiling primarily reflects data limitations rather than an architectural constraint. ConclusionsIn small-sample radiogenomic prediction, modality choice is at least as important as model choice, and uncertainty quality differs substantially across uncertainty-aware model families. The proposed framework provides a practical basis for comparing accuracy, calibration, modality selection, and structure-aware regularization in multimodal biomedical prediction.
Lukhele, N.; Mostafa, F.
Show abstract
ObjectiveTo develop and evaluate a novel machine learning (ML) framework tailored to a clinical diabetes dataset and to assess whether demographic stratification enhances model performance and interpretability for multiclass diabetes classification. MethodsA clinical dataset of 264 patients records was used to classify individuals into non-diabetic, prediabetic and diabetic categories. Several supervised learning models were trained using 80:20 train-test split and optimized using RandomizedSearchCV Model and 10-fold cross validation. Model performance was evaluated using the metrics accuracy, precision, recall and the F1-score. Area under the receiver operating characteristic curve (AUC) was calculated for the best generalizing model. A structured ML framework was developed for this dataset, incorporating preprocessing, model optimization, age stratification analysis age (<35 vs [≥]35 years) and gender. SHAP was developed for model interpretability. ResultsEnsemble methods demonstrated superior performance in comparison to linear or single-tree approaches, with Gradient Boosting showing the most stable generalization with a test accuracy of 0.981 and stable cross validation accuracy of 0.972. AUC-ROC analysis using Gradient Boosting yielded good discriminative ability across the three diabetes classes: 0.991 (non-diabetic), 0.986 (prediabetic) and 0.972 (diabetic). Stratified analysis showed improved reliability in individuals aged [≥]35 years (accuracy = 0.94, F1-score = 0.92), while performance in younger individuals was unstable due to small sample size. SHAP analysis identified HbA1c, BMI, and age as dominant predictors. ConclusionThis study presents a ML framework integrating age stratified modelling with explainable ML frameworks to improve interpretability. The findings offer clinically relevant results that can support clinical decision-making systems, individualized risk assessment, and potential applications for targeted intervention in diabetes progression.
Li, L. Y.; Lebiecka-Johansen, B.; Byberg, S.; Thambawita, V.; Hulman, A.
Show abstract
Diabetic retinopathy (DR) is a leading cause of vision impairment, requiring accurate and scalable diagnostic tools. Foundation models are increasingly applied to clinical imaging, but concerns remain about their calibration. We evaluated DINOv3, RETFound, and VisionFM for DR classification using different transfer learning strategies in BRSET (n = 16,266) and mBRSET (n = 5,164). Models achieved high discrimination in binary classification (normal vs retinopathy) in BRSET (AUROC 0.90-0.98), with DINOv3 achieving the best under full fine-tuning (AUROC 0.98 [95% CI: 0.97-0.99]). External validation on mBRSET showed decreased performance for all models regardless of the fine-tuning strategy (AUROC 0.70-0.85), though fine-tuning improved performance. Foundation models achieved strong discrimination but poor calibration, generally overestimating DR risk. While the generalist model, DINOv3, benefited from deeper fine-tuning, miscalibration remained evident. These findings underscore the need to improve calibration and the comprehensive evaluation of foundation models, which are essential in clinical settings. Author summaryArtificial intelligence is increasingly being used to detect eye diseases such as diabetic retinopathy from retinal images. Recent advances have introduced "foundation models," which are trained on large datasets and can be adapted to new tasks. We aimed to evaluate how well these models perform in a clinical prediction context, with a focus not only on accuracy but also on how reliably they estimate disease risk. In this study, we compared different types of foundation models using two independent datasets from Brazil. We found that while these models were generally good at distinguishing between healthy and diseased eyes, their predicted risks were often poorly calibrated. In other words, the estimated probabilities did not consistently reflect the true likelihood of disease. We also examined whether adapting the models to the target population could improve performance. Although this approach led to improvements, calibration issues remained. However, post-training correction improved the agreement between predicted risks and observed outcomes. Our findings highlight an important gap between model performance and clinical usefulness. We suggest that improving the reliability of risk estimates is essential before such systems can be safely used in healthcare.
Wei, M.; Zhang, H.; Peng, Q.
Show abstract
BackgroundEarly initiation of substance use is linked to later adverse outcomes, and risk factors come from multiple domains and are shared across substances. In our previous work, traditional time-to-event Cox models identified individual risk factors, but these models are not designed to jointly model multiple outcomes or capture complex non-linear relationships. Multi-task learning (MTL) can leverage shared structure across related outcomes to improve prediction and distinguish common versus substance-specific predictors. However, most MTL studies rely on baseline features and focus on single outcomes, which limits their ability to capture shared risk and temporal changes. Substance use initiation is a time-dependent process that unfolds during development and reflects changing exposures over time. Baseline-only models cannot capture these changes or represent risk dynamics. Discrete-time modeling provides a practical approach by estimating interval-level initiation risk and combining it into cumulative risk at the subject level. By integrating multi-task learning with dynamic modeling, it is possible to share information across outcomes while capturing how risk evolves over time, which may improve prediction performance. MethodsUsing the Adolescent Brain Cognitive Development (ABCD) Study(R) (release 5.1), we developed two complementary multi-task learning (MTL) frameworks to predict initiation of alcohol, nicotine, cannabis, and any substance use. A baseline MTL model predicted fixedhorizon (48-month) initiation using one record per participant, while a dynamic discrete-time MTL model incorporated longitudinal interval data to model time-varying risk. Both models used multi-domain environmental exposures, core covariates, and polygenic risk scores (PRS). Performance was evaluated on a held-out test set using AUROC, PR-AUC, and calibration metrics, and compared with single-task logistic regression (LR). Feature importance was assessed using permutation importance and compared with Cox proportional hazards models. ResultsMTL showed comparable or improved performance relative to LR, with larger gains for low-prevalence outcomes (cannabis and nicotine). Incorporating longitudinal information led to consistent improvements across all outcomes. Dynamic models increased AUROC by +0.044 to +0.062 for MTL and +0.050 to +0.084 for LR, indicating that temporal information was the primary driver of performance gains. Feature importance analyses showed modest overlap across methods, with higher agreement between dynamic MTL and Cox models than static MTL. A small set of features, including externalizing behavior, parental monitoring, and developmental factors, were consistently identified across all approaches. ConclusionsDynamic multi-task learning improves the prediction of substance use initiation by leveraging longitudinal structure and shared information across outcomes. While MTL provides additional gains, incorporating time-varying information is the dominant factor for improving performance. Combining baseline and dynamic frameworks offers a comprehensive strategy for identifying robust risk factors and modeling adolescent substance use initiation.
Piorkowska, N. J.; Olejnik, A.; Ostromecki, A.; Kuliczkowski, W.; Mysiak, A.; Bil-Lula, I.
Show abstract
Background: Machine-learning models based on circulating biomarkers are increasingly used in cardiovascular research; however, model performance alone provides limited insight into how the predictive signal is distributed across features. We aimed to characterize the biomarker signal architecture of a machine-learning model distinguishing ST-elevation myocardial infarction (STEMI) from non-ST-elevation myocardial infarction (NSTEMI), with a focus on signal concentration, redundancy, and conditional complementarity. Methods: We conducted a structured secondary analysis of a previously established, leakage-controlled machine-learning framework (n = 152 patients). The BIOMARKERS feature-set variant (10 biomarkers) was evaluated using outer-fold cross-validation. Model structure was interrogated using (i) leave-one-biomarker-out analysis, (ii) pairwise leave-two-out analysis with pair-excess estimation, (iii) cumulative ablation of top-ranked biomarkers, and (iv) forward reconstruction of minimal biomarker panels. Uncertainty was assessed using bootstrap resampling across folds. Results: The full biomarker model achieved a mean ROC-AUC approaching 0.94. The predictive signal was highly non-uniform, with MMP-2 showing the largest single-feature contribution (mean {Delta}AUC {approx} 0.16). Pairwise analysis identified conditional complementarity between selected non-lipid biomarkers, particularly MMP-2 and EMMPRIN (pair {Delta}AUC {approx} 0.26; positive excess over single-feature effects), whereas lipid-related markers formed a highly correlated and largely redundant sub-cluster. Cumulative ablation demonstrated rapid performance collapse following removal of top-ranked biomarkers, consistent with structural signal concentration. Forward panel analysis showed that a compact subset of biomarkers (three features) achieved performance within ~0.01 ROC-AUC of the full model, indicating the presence of a minimal high-yield panel. Bootstrap confidence intervals suggested that small performance differences should be interpreted with caution. Conclusions: Predictive performance in this biomarker-based model arises from a structured and unevenly distributed signal architecture, characterized by a dominant core biomarker, conditionally complementary contributors, and a redundant lipid cluster. These findings highlight the importance of evaluating model structure, not only aggregate performance, and suggest that biomarker-based machine-learning systems may benefit from architecture-aware interpretation and simplification strategies.
Serrano, A. E.
Show abstract
Machine learning (ML) has emerged as a transformative technology across biomedical and life science sectors, with applications spanning drug discovery, medical imaging, genomics, and clinical decision support (Goecks et al., 2020; Patel et al., 2020). Despite exponential growth in ML-related publications, from fewer than 100 articles in 2003 to nearly 25,000 by 2021 (NCBI, 2022), adoption among industry professionals remains uneven and sector-dependent. Understanding what drives or inhibits this adoption is critical for organisations seeking to leverage ML capabilities in research and clinical practice. Technology adoption in organisational contexts has been extensively studied through the Technology Acceptance Model (TAM), originally proposed by Davis (1989) and subsequently extended to incorporate external variables influencing perceived usefulness (PU) and perceived ease of use (PEU) (Venkatesh & Davis, 1996). While TAM has been applied across multiple industries, its application within biomedical and life science contexts remains limited, and the industry-specific factors that shape ML acceptance in this sector have not been systematically examined. Two external variables are particularly relevant to life science professionals. First, the bibliometric journal impact factor (JIF) functions as a cognitive signal of scientific credibility, a sector where evidence-based decision-making is culturally embedded, and publication quality serves as a proxy for technological legitimacy (Garfield, 1996). Second, technology hype, operationalised through the Gartner Hype Cycle framework, represents a social influence variable that shapes organisational expectations and investment decisions around emerging technologies (Gartner Inc., 2018). Whether these variables influence ML acceptance among life science professionals, alongside individual knowledge and experience, has not been empirically tested. This study addresses that gap by investigating ML technology acceptance among 213 biomedical and life science professionals across EMEA, LATAM, and North America, using a cross-sectional quantitative survey and PLS-SEM analysis. The TAM model is extended with three external variables, JIF, technology hype, and prior knowledge and experience, to test their influence on PU and PEU in this specific professional context. Additionally, the study examines demographic and regional differences in ML acceptance, with particular attention to variation between academic researchers and healthcare professionals. The findings contribute a validated, sector-specific extension of TAM for life sciences, provide actionable insights for organisations seeking to accelerate ML implementation, and establish a framework for future subsector-specific research.
Pandey, A. K.
Show abstract
Background: Perioperative mortality prediction in resource-limited surgical settings remains challenging due to class imbalance, missing data, and the heterogeneity of postoperative complications. Existing risk scores such as POSSUM depend on intraoperative variables and do not quantify prediction uncertainty. Methods: We developed a prevalence-adaptive Bayesian ensemble comprising three stochastic models: classifier Variational Autoencoder (VAE, AUC=0.95), a Flipout Last Layer network (AUC=0.84), and a Monte Carlo Dropout network (AUC=0.80), trained on 697 patients (39 deaths, prevalence 5.59%) with 67 preoperative and postoperative features. Class imbalance (16.9:1) was addressed through Variational Autoencoder augmentation: two class-conditional generative VAEs produced 619 synthetic survivor and 619 synthetic death records, yielding a balanced training corpus of 1,935 samples. VAE augmentation was selected over SMOTE and random oversampling after a comparative study (F1: random oversampling 0.61 vs VAE augmentation 0.77). Validation used a held-out set of 233 patients (13 deaths, 220 survivors). A six-stage prediction pipeline incorporated weighted base risk, a three-path prevalence-adaptive gate, Shannon entropy uncertainty quantification, and rank-transform calibration. Sensitivity analysis was conducted across all six empirically derived hyperparameters. A whole-cohort death audit evaluated all 52 deaths from the complete 930-patient dataset through the deployed clinical decision support system. Statistical analysis included Kruskal-Wallis testing of entropy across triage groups, Wilson score confidence intervals for performance metrics, and Spearman rank correlation for LIME-SHAP interpretability concordance. Results: On the validation cohort the ensemble achieved complete separation (sensitivity 100%, specificity 100%, Youden J=1.000; TP=13, FP=0, TN=220, FN=0). The whole-cohort death audit identified 36 of 52 deaths (sensitivity 69.2%, 95% CI 55.7%-80.1%; precision 100%, 95% CI 90.4%-100.0%; F1=0.818, bootstrap 95% CI 0.732-0.894). Shannon entropy differed significantly across triage levels (Kruskal-Wallis H(2)=24.212, p<0.001, {epsilon}2=0.453), confirming a monotone gradient SAFE < CRITICAL < GRAY ZONE. All six hyperparameters were invariant across their tested ranges (J=1.000 throughout; Supplementary Tables S1-S2). LIME and SHAP rankings showed statistically significant concordance (Spearman {rho}=0.440, p=0.024; Kendall T=0.357, p=0.011), with 4 of 6 principal mortality determinants shared across both methods. Conclusions: A prevalence-adaptive Bayesian ensemble with entropy-based uncertainty triage achieves zero false positive alerts and clinically meaningful audit sensitivity in perioperative mortality prediction. Complete hyperparameter invariance confirms that reported performance reflects structural properties of the calibration architecture. The 16 missed deaths represent feature-invisible cases beyond current observational feature capacity.
Hakata, Y.; Oikawa, M.; Fujisawa, S.
Show abstract
BackgroundFederated learning (FL) enables collaborative model training across institutions without sharing patient-level data. However, standard FL algorithms such as FedAvg degrade under non-independently and non-identically distributed (non-IID) data, a prevalent condition when patient demographics, scanner hardware, and disease prevalence differ across hospital sites. ObjectiveWe propose iPS-MFFL (Individualized Per-Site Meta-Federated Feature Learning), a federated framework with a hierarchical local-model architecture that addresses non-IID heterogeneity through (1) a shared feature extractor, (2) multiple weak-learner classification heads that can be trained with heterogeneous training objectives to promote complementary decision boundaries, (3) independent per-learner server aggregation so that each weak learners parameters are averaged only with its counterparts at other clients, and (4) a lightweight meta-model -- itself federated -- that adaptively stacks the weak-learner outputs. The specific choices of backbone, weak-learner training objectives, and meta-model are implementation details; in this work we use an ImageNet-pretrained ResNet18 and three heterogeneous losses as a concrete instantiation. MethodsWe evaluate on the Brain Tumor MRI Classification dataset (7,200 images; 4 classes: glioma, meningioma, pituitary tumor, no tumor) partitioned across K = 5 simulated hospital sites using Dirichlet non-IID sampling ( = 0.3). Four baselines are compared: Local-only training, FedAvg, FedProx, and Freeze-FT. All experiments are repeated over three random seeds (13, 42, 2025) and evaluated using paired t-tests, Cohens d effect sizes, and post-hoc power analysis. ResultsiPS-MFFL achieved the highest mean final-round test accuracy point estimate of 85.42 {+/-} 8.70% (mean {+/-} SD across three seeds), compared to FedAvg (78.48 {+/-} 12.66%), FedProx (78.33 {+/-} 14.64%), Freeze-FT (73.98 {+/-} 21.09%), and Local (58.10 {+/-} 11.77%). iPS-MFFL showed the smallest cross-seed SD, suggesting greater robustness to partition heterogeneity. However, one-way ANOVA did not reach statistical significance (F = 1.52, p = 0.270), reflecting the limited number of experimental seeds. Cohens d effect sizes relative to iPS-MFFL ranged from 0.59 (vs. FedProx) to 2.64 (vs. Local); post-hoc pairwise comparisons are reported as exploratory given the non-significant omnibus test. Post-hoc power analysis indicated that statistical power for FL baseline comparisons was only 0.10-0.12 for the observed effect sizes (d {approx} 0.6) at n = 3 seeds. ConclusionsiPS-MFFL provides a practical approach to heterogeneous federated brain tumor classification by combining transfer learning, contrastive weak-learner diversity, and meta-learning. The framework demonstrated the highest mean accuracy and lowest variance across diverse data partitions. Validation with larger seed pools ([≥] 10 seeds for 80% power), ablation studies, and external multi-center cohorts is needed to establish generality.
Fixman, M.; Abati, A.; Jimenez Nimo, J.; Lim, S.; Mondragon, E.
Show abstract
In contrast to static formalisms, computational definitions describe the operational mechanisms of a model. Simulations are an essential part of the cycle of theory development and refinement, assisting researchers in formulating the precise definitions that models require, and making accurate predictions. This manuscript introduces a computational implementation of Pavlovian learning models in a Python environment, termed Pavlovian Associative Learning Models Simulation (PALMS). In addition to the canonical Rescorla-Wagner model, attentional approaches are implemented, including Pearce-Kaye-Hall, Mackintosh Extended, Le Pelleys Hybrid, and a novel extension of the Rescorla-Wagner model featuring a unified variable learning rate that synthesises Mackintoshs and Pearce and Halls opposing conceptualisations. To our knowledge, only the first attentional model has been previously specified computationally in a general design tool. PALMS integrates a graphical interface that permits the input of entire experimental designs in an alphanumeric format, akin to that used by experimental neuroscientists. It uniquely enables the simulation of experiments involving hundreds of stimuli, such as those used with human participants, and the computation of configural cues and configural-cue compounds across all models, thereby substantially broadening their predictive capabilities. A comprehensive description of the models implementation and the environment functionalities is provided in the paper; these include efficient and accurate operation and instant visualisation of predicted results across different models within a single architecture and environment. We evaluate PALMS by simulating five published experiments in the associative learning literature that assessed the predictive scope of existing models, and we show that this implementation provides neuroscientists with a useful tool for identifying critical variables, refining experimental designs, making precise predictions, comparing model fitness, and formulating new theoretical approaches. PALMS is licensed under the open-source GNU Lesser General Public License 3.0. The environment source code and the latest multiplatform release build are accessible as a GitHub repository at https://github.com/cal-r/PALMS-Simulator. Author summaryResearch on associative learning is multidisciplinary, encompassing disciplines such as neuroscience, AI, psychology, psychiatry, behavioural sciences, planning, and marketing. Unlike static formalisms, precise computational definitions specify how a model operates, enabling model simulation, swift and error-free prediction calculations, which are essential for testing theories, comparing predictions, holding models accountable, and providing a common language across fields. We introduce Pavlovian Associative Learning Models Simulation (PALMS), a user-friendly, open-source Python environment for simulating classical conditioning and studying the role of attention in learning. PALMS implements the prescriptive Rescorla-Wagner and attentional models: Pearce-Kaye-Hall, Mackintosh Extended, Le Pelleys Hybrid, and a new hybrid model with a unified variable learning rate that blends Mackintosh and Pearce-Halls conflicting views. Its graphical interface makes it easy for neuroscientists to enter experiments. Our computational implementation supports simulations with hundreds of stimuli, configural cues, and compounds, broadening the models predictive power. Designed for efficiency, it offers instant visual results and useful features. We evaluate PALMS by simulating five published experiments, highlighting its value for model comparison and refinement, and, more generally, as a tool to assist research.
Brulhart, D.; Magini, G.; Schafer, A.; Schwab, S.; Held, U.
Show abstract
Objectives: Clinical prediction models estimate the risk of a future outcome in patients. Such models are often externally validated using independent datasets; however, even when a model has been rigorously validated in a new setting and patient population, its performance across other clinical settings remains unclear. Therefore, we systematically evaluated model performance and clinical utility across diverse patient populations to quantify the limits of transportability. Methods: Using liver transplantation as an example, we used the UK donation-after-circulatory-death (DCD) risk score and descriptive statistics from Swiss DCD liver transplant populations to simulate realistic target populations with varying donor and recipient characteristics. The risk score's ability to predict one-year graft failure was evaluated using calibration intercept, calibration slope, area under the receiver operating characteristic (ROC) curve, and net benefit. Results: The UK DCD Risk Score's performance depended heavily on the simulated population characteristics. While the score performed adequately in settings similar to those where it was derived, it was not satisfactory in others. Discussion: The study showed, using a risk score in liver transplantation as an example, that the application of a prediction model can be limited in certain external populations when they differ, and that its transportability in new settings is not guaranteed. Conclusion: This study highlights the importance of external validation of clinical prediction models to determine transportability to various target populations. Their application requires careful consideration and potential model re-estimation.
Dahlberg, A. C. H.; Tapiola, O.; Luisto, R.; Puranen, T.; Sanmark, E.; Vartiainen, V.
Show abstract
Background: Embedding models are an integral part of generative AI architectures, transforming text into embedding vectors that represent semantic content in numerical form. Despite their central role, their performance in clinical settings remains underexplored. We evaluate embedding models across two tasks: semantic difference detection in clinical texts, and data retrieval from patient records. Methods: Eight models were applied to synthetic discharge summaries in English, Finnish, and Swedish. Semantic sensitivity was assessed by introducing controlled perturbations (deletion, modification, and paraphrasing) at three levels of severity; cosine similarity, and L1 and Euclidean distances were computed between the vectors of the original and perturbed texts. Partial vectors were compared to explore dimensionality reduction. Two models with the biggest contrast in semantic difference detection were evaluated on retrieval of relevant information from real Finnish vascular surgery records. Results: Embedding vectors captured semantic differences in clinical text: content deletion and modification produced larger increases in vector distance than paraphrasing. On average, models detected the direction of semantic change correctly, but case-level performance varied considerably. Qwen3-Embedding-8B was the only model with zero directional errors, while multilingual-E5-large erred in 13.8% of cases. In data retrieval, Qwen3-Embedding-8B again outperformed multilingual-E5-large, though the margin was narrower: sufficiency scores were 3.25 vs. 3.17 out of 5 for the first query and 2.25 vs. 1.15 out of 5 for the second query. For some models, as few as 0.6-1.2% of dimensions sufficed to replicate full-vector accuracy; principal component analysis and coordinate-level analysis did not account for this finding. Conclusions: Our results show that the choice of embedding model is important: performance differences between models can be large enough to determine whether clinically relevant information reaches the end user, and model weaknesses can be both task-specific and context-dependent.
Alsammani, A.; Johnson, M.; Elrefaei, J.
Show abstract
Objective: To develop, calibrate, and interpret machine learning models for predicting in-hospital mortality among intensive care unit (ICU) patients using clinical data collected during the first 24 hours of admission. Methods: We analyzed 53,866 adult ICU admissions from the MIMIC-IV (v2.2) database, including 5,787 in-hospital deaths (10.7%). An enhanced feature-engineering pipeline generated 88 laboratory-based features that captured distributional characteristics, temporal trends, and measurement frequency. Five machine learning classifiers were evaluated: L2-regularized logistic regression, random forest, XGBoost, LightGBM, and a calibrated soft-voting ensemble. Models were developed using a stratified 64:8:8:20 split for training, validation and hyperparameter tuning, calibration, and testing. Performance was assessed on a held-out test set (n = 10,774) using the area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC), Brier score, calibration analysis, decision curve analysis (DCA), and SHAP-based model interpretation. Results: The calibrated ensemble achieved the best overall performance, with an AUROC of 0.856 (95% CI: 0.846-0.867), an AUPRC of 0.449 (95% CI: 0.418-0.480), and a Brier score of 0.078. XGBoost (AUROC 0.856; AUPRC 0.435) and LightGBM (AUROC 0.854; AUPRC 0.436) demonstrated performance comparable to the ensemble and significantly outperformed logistic regression (AUROC 0.823; AUPRC 0.376), yielding absolute AUROC improvements of approximately 0.031-0.033 (p < 0.001). Calibration substantially improved probabilistic predictions, reducing Brier scores by 42% for XGBoost (0.134 to 0.078) and 50% for LightGBM (0.151 to 0.076). Decision curve analysis demonstrated consistent net clinical benefit across the 5%-20% risk-threshold range. Key predictors included age, blood urea nitrogen, ICU subtype, measurement frequency, and lactate-related features. Model performance remained robust across ICU subtypes, with AUROC values exceeding 0.79. Conclusion: A calibrated and interpretable machine learning framework based on early ICU clinical data provides accurate and clinically actionable mortality risk estimates. By integrating trajectory-aware feature engineering, probabilistic calibration, and decision-analytic evaluation, this approach advances ICU mortality prediction toward more reliable and trustworthy clinical decision support systems.
Patel, K.; Beedala, P.
Show abstract
BackgroundMachine learning models for intensive care unit (ICU) mortality prediction achieve strong internal discrimination yet rarely undergo external validation with calibration assessment -- a gap undermining clinical deployment. Calibration, the agreement between predicted probabilities and observed event rates, is prerequisite for threshold-based decisions yet remains underreported. MethodsWe conducted a retrospective cohort study using MIMIC-IV (v2.2; n = 52,028 ICU stays) for model development and eICU (n = 114,060) for independent external validation. Logistic regression, random forest, and gradient boosting (XGBoost) were evaluated on first-24-hour clinical variables. Discrimination was assessed via receiver operating characteristic area (AUROC) and precision-recall area (AUPRC); calibration via slope, intercept, and expected calibration error (ECE). Post-hoc logistic recalibration was applied externally. Clinical utility was evaluated by decision curve analysis benchmarked against Acute Physiology and Chronic Health Evaluation (APACHE) scores. Subgroup analyses examined sex and race/ethnicity; SHapley Additive exPlanations (SHAP) assessed feature importance. Uncertainty was estimated via bootstrap resampling; the study adheres to TRIPOD guidelines. ResultsThe recalibrated XGBoost model achieved internal AUROC 0.847 (95% CI: 0.832-0.860) and external AUROC 0.819 (95% CI: 0.815-0.823). Internal calibration was near-ideal (slope 0.982; intercept 0.001), whereas external validation revealed systematic risk overestimation (intercept -0.678) attributable to prevalence-driven label shift. An intercept-only adjustment reduced ECE by 26%. The model outperformed APACHE (AUROC 0.817 vs. 0.795; p < 0.001). ConclusionsICU mortality models exhibit transportable discrimination but clinically significant calibration drift under cross-institutional deployment. Calibration evaluation and targeted recalibration should be mandatory in any clinical machine learning validation framework.
Killekar, A.; Shanbhag, A.; Miller, R. J.; Dey, D.; Bourque, J.; Phillips, L.; Chareonthaitawee, P.; Slomka, P.
Show abstract
BackgroundPrevious studies evaluated large language model (LLM) performance on the American Society of Nuclear Cardiology (ASNC) Board Preparation Exam. Without domain-specific context, the best model (GPT-4o) achieved 63.1%, below the estimated 65% passing threshold and the 78% mean score of human fellows-in-training (FITs). Providing textbook context improved GPT-4o to 73.8% on text-only questions, but still fell short of human trainees. Whether next-generation LLMs with retrieval-augmented generation (RAG) can exceed this gap is unknown. MethodsClaude Opus 4.7 and GPT-5.5 were administered all 168 questions (141 text-only, 27 image-based) from the 2023 ASNC Board Preparation Exam across 5 iterations each, using RAG with a nuclear cardiology textbook, companion atlas, and ASNC clinical guidelines. Claude used local FAISS-based semantic retrieval; GPT-5.5 used Azures cloud-hosted vector store. Performance was compared to prior LLM results and 13 human FITs. ResultsAcross 5 iterations, Claude Opus 4.7 achieved a mean accuracy of 86.3% {+/-} 1.4% (text 88.8%, image 73.3%). GPT-5.5 achieved 86.7% {+/-} 2.2% (text 88.5%, image 77.0%) but refused a mean of 12.2 questions (7.3%) per iteration due to safety filters. Both models surpassed the human FIT mean (78.0%) and the estimated passing threshold. Compared to GPT-4o without context (63.1%), this represents a 23-percentage-point improvement in 18 months. ConclusionNext-generation LLMs with RAG now surpass average human trainee performance on nuclear cardiology board preparation questions, suggesting significant potential as educational tools and knowledge-reference aids in cardiovascular imaging. Condensed AbstractAcross 5 iterations each, Claude Opus 4.7 and GPT-5.5 with retrieval-augmented generation achieved mean accuracies of 86.3% and 86.7% on the 2023 ASNC Board Preparation Exam (168 questions), both surpassing the mean human fellow-in-training score of 78%. GPT-5.5 refused a mean of 12.2 questions (7.3%) per iteration due to safety filters. These results represent a 23-percentage-point improvement over the best prior LLM without context (63.1%), demonstrating that RAG-enhanced LLMs have reached human-level proficiency in nuclear cardiology knowledge. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=111 SRC="FIGDIR/small/26352768v2_ufig1.gif" ALT="Figure 1"> View larger version (49K): org.highwire.dtl.DTLVardef@5f2465org.highwire.dtl.DTLVardef@4e80d3org.highwire.dtl.DTLVardef@1ebbb93org.highwire.dtl.DTLVardef@167d3c1_HPS_FORMAT_FIGEXP M_FIG C_FIG Overview of the three-study research arc evaluating LLM performance on the 2023 ASNC Board Preparation Exam. Study 1 (2024) tested four LLMs without context (best: GPT-4o, 63.1%). Study 2 (2025) added textbook context to GPT-4o (73.8%). Study 3 (2026, current) evaluated Claude Opus 4.7 and GPT-5.5 with retrieval-augmented generation across 5 iterations each (mean 86.3% and 86.7%, respectively), both surpassing the human fellow-in-training mean of 78%. Right panel shows the performance scale with key thresholds.
Holen, A. S.; Larsen, M.; Hofvind, S.
Show abstract
Background and ObjectiveIncreasing screening volumes, combined with global shortage of radiologists and a high proportion of normal mammograms, challenge the efficiency and sustainability of breast cancer screening. Artificial intelligence (AI) has the potential to improve resource allocation, workflow efficiency and diagnostic performance by supporting and partially replacing radiologists in the interpretation process. This randomized, controlled, parallel-group, non-inferiority, single-blinded trial evaluates whether an AI-supported reading strategy, involving one or two radiologists depending on AI risk stratification, is non-inferior to standard independent double reading. The primary outcome is the number of screen-detected breast cancer cases in each group. MethodsWomen invited to BreastScreen Norway in the Western, Central, and Northern Norway Regional Health Authorities are eligible for inclusion. Following written informed consent, participants are randomized 1:1 to the control group (standard independent double reading by two radiologists) or the intervention group. In the intervention group, mammograms are analyzed using Transpara. Examinations with AI scores of 1-7 are interpreted by a single radiologist, whereas examinations with scores of 8-10 undergo independent double reading. Radiologists are blinded to AI scores and AI image markings during the initial interpretation; this information is disclosed during consensus meetings. Non-inferiority will be assessed by estimating confidence interval for the difference in screen-detected cancer rates between groups. Non-inferiority will be concluded if the upper bound of the confidence interval does not exceed the predefined non-inferiority margin. ConclusionsThe trial addresses a critical challenge in breast cancer screening: maintaining diagnostic performance while improving efficiency in the context of workforce constraints and a high prevalence of normal examinations. By evaluating a risk-stratified AI-supported reading strategy within a population-based screening program, the study will provide important evidence on whether AI can be safely integrated to optimize workload distribution while preserving cancer detection rates. Trial registrationThe ClinicalTrials.gov registry (NCT06032390)
Choi, J.; Kim, Y. J.; Lyu, P.; Luan, Y. L.; Toh, S. M.
Show abstract
Artificial intelligence (AI) is increasingly incorporated into diagnostic decision-making, raising questions about physician responsibility following AI-involved adverse diagnostic events. Explainable AI (XAI) has been proposed to improve transparency and trust, but its influence on public reactions remains unclear. In a randomised vignette-based experiment, 652 adults from the United States and United Kingdom were assigned to one of six conditions in a 3 (diagnostic source: AI alone, human radiologist alone, or human-AI collaboration) x 2 (explanation: present or absent) between-subjects design. Participants read a scenario in which a chest X-ray was initially interpreted as normal but lung cancer was diagnosed five months later, indicating that the original interpretation had missed the cancer. In explanation conditions, participants received additional information about how the diagnosis had been reached, including AI heatmap-based explanations in the AI conditions. Participants rated radiologist responsibility, likelihood of complaint, and intention to pursue legal action. Among 652 participants (mean age 42.2 years; 50.2% female), responsibility ratings were significantly lower when AI alone made the diagnostic decision (mean 4.73, 95% CI 4.53-4.93) compared with human-only decision-making (5.78, 95% CI 5.59-5.98; p<0.001) and human-AI collaboration (5.54, 95% CI 5.34-5.74; p<0.001). Complaint likelihood showed a similar pattern. Intentions to pursue legal action followed the same directional trend but were marginally significant. Neither explanations nor explanation-by-source interactions were associated with outcome measures. These findings suggest that the public expects physicians to remain accountable when AI is involved in diagnostic decision-making, particularly in collaborative settings. Providing explanatory information about how AI systems reach decisions may be insufficient to change perceptions of physician responsibility following adverse diagnostic events.
Overmars, L. M.; Allaart, C.; Bron, E. E.; Brunner La Rocca, H.-P.; de Bresser, J.; Muller, M.; van Osch, M. J. P.; Teunissen, C.; Tijms, B. M.; Wolters, F. J.; Biessels, G. J.; Heart-Brain Connection Consortium,
Show abstract
Background: Vascular cognitive impairment (VCI) and small vessel disease (SVD) involve many interconnected factors influencing multiple outcomes, also beyond cognitive decline. Bayesian networks (BNs) can help unravel these complex interrelations, which we demonstrate in this proof-of-concept study in the Heart-Brain Connection cohort, including memory-clinic patients with SVD, patients with heart failure, carotid occlusive disease, and reference participants. Methods: We trained BNs and jointly modelled cognitive decline (Clinical Dementia Rating (CDR) increase) and major adverse cardiovascular events (MACE) over five years as outcomes in relation to multiple demographic and disease factors and emerging imaging and plasma biomarkers, also considering possible non-random dropout. Results: Of 566 individuals (median age 68, 64% men), 134 had MACE and 112 experienced CDR increase. Diagnostic group and baseline cognition were key determinants of both outcomes. The BN identified baseline clinical severity as a non-random dropout source. Plasma biomarkers formed an interconnected subnetwork, linked to demographic and vascular factors, but without direct dependencies with outcomes. The trained BN also provides individualized inference under partial evidence, informing on outcome probabilities. Conclusion: This proof-of-concept study demonstrates how BNs quantify and visualize the dependency structure underlying prognostic heterogeneity in VCI and SVD, including non-random dropout and positioning of emerging biomarkers.
Jean, A.; Merceron, A.; Le Saux, A.; Mercier, E.; Benillouche, P.
Show abstract
This study aims to assess women's perceptions of artificial intelligence (AI) used in breast cancer screening in France by examining their knowledge of AI and the barriers to their participation in organized screening. The results of a survey conducted in June 2025 among a national sample of 2000 women (aged 40-75) reveal limited participation and persistent concerns among women. Nevertheless, despite a low awareness of specific AI applications, a large majority of the women surveyed are very favorable to the use of AI in breast cancer diagnosis, even considering it a lever to increase screening participation.